In [1]:
from __future__ import division, print_function, unicode_literals

%matplotlib inline

import os

import IPython.display
import numpy as np
import matplotlib.pyplot as plt

import data_io
import json_io
import utilities

import requests
import BeautifulSoup
import arrow
import twython
import textblob

# Keyboard shortcuts: http://ipython.org/ipython-doc/stable/interactive/notebook.html#keyboard-shortcuts

Task Objective

My objective for this effort is to demonstrate my data processing and analysis capabilities outside of my tradiational hyperspectral remote sensing work. I have played in that arena for a long time and picked up a good number of modeling and analysis skills. This present effort is meant to be a quick example of processing unfamiliar data using new tools and protocols. This work needs to be quick, efficient, and have a clear punch line. This notebook is where plan to explore these tools and the data they help me fetch. I'll use another notebook for the analysis once all the data is sorted out.

I made a statement a few days ago indicating that I would like to solve new types of problems. For example, I might treat a new movie as a collection of word feature vectors pulled out of a Twitter feed. I would then make statistical associations with other movies having known performance characteristics such as viewer retention and engagement. The validity of the association process could be verified by testing with lablled data. Results from such a process might be useful for someone's planning efforts.

Early Morning Thoughts

TextBlob

Very very early this morning I could not sleep as I kept thinking about this little task. I kept reviewing in my head details of the objective in the section above. I need a way to make sense of text describing a movie. A quick search on Github turned up this great Python package: TextBlob. The text from the website says:

A library for processing textual data. It provides a simple API for diving into common natural
language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, translation, and more.

Behind the scenes it uses the packages NLTK and patterns. I haven't done much at all with natural language text processing, but this tool looks like a great place to start! TextBlob will return two metrics describing the sentiment of a chunk of text: Polarity and Subjectivity. Those two numbers will be a great starting point for visualizing this stuff.

Open Movie Database API: IMDb and RottenTomatoes

Next, I found a nice web API for querrying information from both IMDb and RottenTomatoes: The OMDb API. This site used to reside at this other address http://imdbapi.com/, but not anymore. Take a look over there for an interesting writeup of the site owner's interaction with IMDb.com's lawyers. Very clever! Anyhow, you can use this service to very easily search for movie information pulled from IMDb and RottenTomatoes.

Twitter API

I found several Python packages on Github searching for Twitter's API service, here are two that seem most well-maintained: https://github.com/geduldig/TwitterAPI and https://github.com/ryanmcgrath/twython. Just from reading over each package, I really like Twython's minimalistic interface. I will probably go with that.

Sign up for Twitter developer API at https://dev.twitter.com/apps. I named my app MovieInfoPierreDemo. That name is so goofy, but I felt rushed! Once it's setup I need to grab the Consumer Key and the Consumer Secret. Its a bit confusing sorting through all the authentiction options. The Twython documentation finally had great advice if all one cares about is read access to Twitter: use Oauth2!

Some Interesting Movie Lists

Next I wanted a list of interesting movies to play with. I found this list of titles from 2012 for sequels of popular movies: Sequel Movies 2012. I figure I'll need to manipulate some of that data by hand just to get it done quickly. I would normally write some code to automate this step, but right now this is a one-time deal.

Comparing Words

Once the data is all assembled into a useable form, my plan is to compare words using the Bag-of-Words approach. This involves computing histograms of word frequencies for some ensemble of words (e.g. words collected from Tweets). There are several ways to compare histograms with the goal of computing a similarity metric. My favorite is the Earth-Mover distance. It's like this: given two diffrent histograms with the same bins, think of the two distributions as two piles of dirt. Then the EMD metric is the minimum amount of work a bulldozer have to do in order to make one pile of counts look like the other. Last year I wrapped up this Fast C++ EMD implementation as a Python extension for a work project.

In that project the distance between bins was simple: just the Euclidean distance between them in bin space. But in this task I am dealing with words as labels for each bin. There is no physical meaning associated with which word is represented in the adjacent bin. The words might be sorted alphabetically, or by size, or just at random.

About ten years ago I implemented the Levenshtein Distance in IDL, way before I ever started using Python. I could have translated that older IDL version over to Python, but it was actually easier to just go Googling for a Python implementation. One of the first results that came back was py-editdist. That's what I love about Python: if you need a new function there's a good chance somebody already implemented something similar and made their repo publicly-available.

Random Ideas

  1. Maybe I can show correlationn between Twitter data and IMDb / RottenTomatoes reviews.
  2. It's not clear if EMD will be all that usefull here.
  3. compute sentiment from before/after Twitter data. change in values after release date?
  4. potential to show temporal dependence from Twitter data, asuuming sufficient volume of tweets per day or week.
  5. curious idea for visualizing temporal sentiment. each week compute 2D vector averaged over tweets. Concatenate vectors into form of a chain.
  6. Anything to be done with Mutual Information?
  7. I am going to focus on just the sentiment stuff from TextBlob. forget about EMD / Bad-of-Words stuff for now.

Work Plan

Given all that brainstorming above, let's make a plan of action!

I am going to focus on acquiring data from various sources and aggregate it into a form suitable for visualization and analysis. I don't think I'll have enough time for any exhaustive analysis. The most I want to get done then is generating a nice visualization.

  1. Install necesary software packages on my laptop.
  2. Get Twitter feed connected and running: Oauth2, API key, etc.
  3. Browse through the Sequel Movies 2012 web site and make a list of interesting movies to play with.
  4. Automate looking up movie details from OMDb site. I want information about IMDbID plus RottenTomatoes viewer feedback. Use Requests package.
  5. Work with Twitter API: Search for tweets about these movies over two time different periods: the months just prior to, and just after release date.
  6. Aggregate Twitter feed text for each movie (keeping before and after sets as separate collections).
    1. what about a time period just after the sequel is announced?
  7. Apply TextBlob methods to text data pulled from RottenTomatoes and from Twitter search results.
  8. Generate graphics (scatterplots) showing before and after sentiment(s) for original movie and sequel.
  9. Stop here for now.

List of Interesting Movies

I make a short list of recent movies and stored the name and reay of release in a simple YAML text file.


In [2]:
fname_movies = 'movies.yml'

# Run this sets of lines to view the text contents of my movie file.  Or just open up the file in your text editor.
# with open(fname_movies) as fi:
#      for v in fi.readlines():
#          print(v.rstrip())

info_movie_list, meta = data_io.read(fname_movies)

for item in info_movie_list:
    print('{0} ({1})'.format(item['name'], item['year']))


The Hunger Games (2012)
The Hunger Games: Catching Fire (2013)
The Hobbit: An Unexpected Journey (2012)
The Hobbit: The Desolation of Smaug (2013)
Diary of a Wimpy Kid: Rodrick Rules (2011)
Diary of a Wimpy Kid: Dog Days (2012)
The Expendables (2010)
The Expendables 2 (2012)

Movie Details from IMDb and RottenTomatoes via OMDb API

I found a nice web API for querrying information from both IMDb and RottenTomatoes: The OMDb API. This site used to reside at this address http://imdbapi.com/, but not anymore. Take a look over at the old address for an interesting writeup of the site owner's interaction with IMDb.com's lawyers. Very clever!

Anyhow, let's use this service to search for movie information by using the Requests package to fetch data in two steps. The first is to determine the IMDb ID number for each movie in the list. With that information we can then pull down additional details from RottenTomatoes.

Note: I just found out that RottenTomatoes has their own API. I just now signed up for an API key and have read quickly through the documentation. It looks simple and easy-to-use. If there's time I'll probably switch over to that instead of OMDb.


In [3]:
omdbapi_url = 'http://www.omdbapi.com'

movie_name = info_movie_list[1]['name']  # index #1 should yield Hunger Games for 2013.
movie_year = info_movie_list[1]['year']

params = {'t': movie_name, 'y': movie_year, 'tomatoes': True}

response = requests.get(omdbapi_url, params=params)
info_omdb = response.json()

# Use IPython's builtin display function for nicely-formatted view of the response data.
IPython.display.display(info_omdb)


{u'Actors': u'Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Philip Seymour Hoffman',
 u'Awards': u'N/A',
 u'BoxOffice': u'$335.9M',
 u'Country': u'N/A',
 u'DVD': u'N/A',
 u'Director': u'Francis Lawrence',
 u'Genre': u'Action, Adventure, Sci-Fi, Thriller',
 u'Language': u'N/A',
 u'Metascore': u'N/A',
 u'Plot': u'Katniss Everdeen and Peeta Mellark become targets of the Capitol after their victory in the 74th Hunger Games sparks a rebellion in the Districts of Panem.',
 u'Poster': u'http://ia.media-imdb.com/images/M/MV5BMTAyMjQ3OTAxMzNeQTJeQWpwZ15BbWU4MDU0NzA1MzAx._V1_SX300.jpg',
 u'Production': u'Lionsgate Films',
 u'Rated': u'PG-13',
 u'Released': u'22 Nov 2013',
 u'Response': u'True',
 u'Runtime': u'2 h 26 min',
 u'Title': u'The Hunger Games: Catching Fire',
 u'Type': u'movie',
 u'Website': u'http://www.thehungergamesexplorer.com/us/',
 u'Writer': u'Simon Beaufoy, Michael Arndt',
 u'Year': u'2013',
 u'imdbID': u'tt1951264',
 u'imdbRating': u'8.3',
 u'imdbVotes': u'69,981',
 u'tomatoConsensus': u'Smart, smoothly directed, and enriched with a deeper exploration of the franchise's thought-provoking themes, Catching Fire proves a thoroughly compelling second installment in the Hunger Games series.',
 u'tomatoFresh': u'205',
 u'tomatoImage': u'certified',
 u'tomatoMeter': u'90',
 u'tomatoRating': u'7.5',
 u'tomatoReviews': u'229',
 u'tomatoRotten': u'24',
 u'tomatoUserMeter': u'93',
 u'tomatoUserRating': u'4.4',
 u'tomatoUserReviews': u'228,552'}

Notice some of the entries have some characters encoded with ampersande encoding. I woud like to decode all such occurances to regular text and I found a nice implementation of just such a function over at stackoverflow.com. I am calling it from my utilities.py module. This next line fixes all occurances of these encodings.


In [4]:
# Undo any ampersand encoding in the text returned from OMDbAPI.com.
for k in info_omdb.keys():
    info_omdb[k] = utilities.decode(info_omdb[k])

So! There's a lot of stuff in that response data, but for now I'm going to focus on just a few pieces of data: the text strings corresponding to the keys Plot and tomatoConsensus, the viewer ratings from IMDb and RottenTomatoes, plus the date the movie was released to theaters. Just to keep things simple here, I am going to copy out only the handful of fields I care about, and take care of date parsing at the same time.

As far as dates go, I really, really dislike using Python's builtin date and time tools. The good news is there now exists a much better choice for working with dates: Arrow. From the web site:

Arrow is a Python library that offers a sensible, human-friendly approach to creating, manipulating, formatting and converting dates, times, and timestamps. [...] Arrow is heavily inspired by moment.js and python-requests.

See here http://crsmithdev.com/arrow/#format and here http://crsmithdev.com/arrow/#tokens for date/time format details.


In [5]:
format = 'DD MMM YYYY'
date_released = arrow.get(info_omdb['Released'], format)

# print('date_released: ', date_released.year, date_released.month, date_released.day)

info_movie = {'Title': info_omdb['Title'],
              'Plot': info_omdb['Plot'],
              'tomatoConsensus': info_omdb['tomatoConsensus'],
              'Released': date_released,
              'imdbRating': float(info_omdb['imdbRating']),
              'tomatoRating': float(info_omdb['tomatoRating'])}
             
IPython.display.display(info_movie)


{u'Plot': u'Katniss Everdeen and Peeta Mellark become targets of the Capitol after their victory in the 74th Hunger Games sparks a rebellion in the Districts of Panem.',
 u'Released': <Arrow [2013-11-22T00:00:00+00:00]>,
 u'Title': u'The Hunger Games: Catching Fire',
 u'imdbRating': 8.3,
 u'tomatoConsensus': u"Smart, smoothly directed, and enriched with a deeper exploration of the franchise's thought-provoking themes, Catching Fire proves a thoroughly compelling second installment in the Hunger Games series.",
 u'tomatoRating': 7.5}

Rotten Tomatoes API


In [6]:
uri_base = 'http://api.rottentomatoes.com/api/public/v1.0'
uri_home = 'http://api.rottentomatoes.com/api/public/v1.0.json?apikey=mfy52ff3xbgcdwxqr9fwvjw9'

Twitter's Search API

Following very helpful instructions from Twython documentation and using my own general-purpose data_io module for storage. This part is so much easier if all you need is access to the search API, and not access to any personal info.


In [7]:
# Set this flag to True when you need to generate a new Twitter API access token.
flag_get_new_token = False

fname_twitter_api = 'twitter_api.yml'

# Load Twitter API details.
info_twitter_api, meta = data_io.read(fname_twitter_api)

if not 'access_token' in info_twitter_api:
    flag_get_new_token = True

if flag_get_new_token:
    # Use my Twitter dev API credentials to fetch a new access token.
    twitter = twython.Twython(info_twitter_api['consumer_key'], info_twitter_api['consumer_secret'], oauth_version=2)
    
    print('Fetching new token...')
    access_token = twitter.obtain_access_token()

    # Store the token for later use.
    info_twitter_api['access_token'] = access_token
    data_io.write(fname_twitter_api, info_twitter_api)

    print('New token stored: {:s}'.format(fname_twitter_api))
else:
    twitter = twython.Twython(info_twitter_api['consumer_key'], access_token=info_twitter_api['access_token'])

# This little try/except section is the only way I know (so far) to determine if I have a valid Twitter access token
# when using OAuth 2.
try:
    temp = twitter.get_application_rate_limit_status()
except twython.TwythonAuthError:
    msg = 'Boo hoo, you may need to regenerate your access token.'
    raise twython.TwythonError(msg)

API Current Status Query

Just for the fun of it here, let's print out some interesting tidbits about the current status of my Twitter API key. This includes some version numbers and the current status of various rate limits. If you are going to be calling the API frequently you might run against these rate limits. Watch out!


In [8]:
print('\nApp key:    {}'.format(twitter.app_key))
print('OAuth version:  {}'.format(twitter.oauth_version))
print('API version:    {}'.format(twitter.api_version))
print('Authenticate URL: {}'.format(twitter.authenticate_url))

# Rate limit stats.
info_rate = twitter.get_application_rate_limit_status()

# Application limits.
n_limit = info_rate['resources']['application']['/application/rate_limit_status']['limit']
n_remaining = info_rate['resources']['application']['/application/rate_limit_status']['remaining']
t_reset = info_rate['resources']['application']['/application/rate_limit_status']['reset']

delta = arrow.get(t_reset) - arrow.now() 
t_wait = delta.seconds/60

print()
print('Application limit:     {} requests'.format(n_limit))
print('Application remaining: {} requests'.format(n_remaining))
print('Application wait time: {:.1f} min.'.format(t_wait))

# Search limits.
n_limit = info_rate['resources']['search']['/search/tweets']['limit']
n_remaining = info_rate['resources']['search']['/search/tweets']['remaining']
t_reset = info_rate['resources']['search']['/search/tweets']['reset']

delta = arrow.get(t_reset) - arrow.now() 
t_wait = delta.seconds/60

print()
print('Search limit:          {} requests'.format(n_limit))
print('Search remaining:      {} requests'.format(n_remaining))
print('Search wait time:      {:.1f} min.'.format(t_wait))


App key:    JxUV7dXAvXigyxyWafOGUA
OAuth version:  2
API version:    1.1
Authenticate URL: https://api.twitter.com/oauth/authenticate

Application limit:     180 requests
Application remaining: 178 requests
Application wait time: 14.9 min.

Search limit:          450 requests
Search remaining:      440 requests
Search wait time:      11.5 min.

Follow search instructions using OAuth-2 at Twythow site here. The following set of links to Twitter's documentation that I found most useful:

I also found a nice ipython notebook online showing an example using Twython, unfortunately it was for the older version 1.0 Twitter API. The implementation details have changed with Twitter's API version 1.1.

The page Help with the Search API has this helpful tidbit of information when you expect a large number of return tweets. In this case it is important to pay attention to iterating through the results:

Iterating in a result set: parameters such count, until, since_id, max_id allow to control how we iterate through search results, since it could be a large set of tweets. The 'Working with Timelines' documentation is a very rich and illustrative tutorial to learn how to use these parameters to achieve the best efficiency and reliability when processing result sets.

Ok, now I've read through the Twython package documenation and the source code. The authors of his package have fully taken into account the advice above given by Twitter. The way forward here is to use the instance method Twython.cursor. My wrapper is now a lot simpler than what I had earlier this afternoon! Woo!

Contents of a Tweet

Here is an example of the JSON contents of a tweet returned through the Twitter API.

{'contributors': None,
 'coordinates': None,
 'created_at': 'Sat Dec 28 17:26:58 +0000 2013',
 'entities': {'hashtags': [],
              'symbols': [], 
              'urls': [],
              'user_mentions': [{'id': 848116975,
                                  'id_str': '848116975',
                                  'indices': [42, 58],
                                  'name': 'ZZZZZZZ',
                                  'screen_name': 'QQQQQQQ'},
                                 {'id': 2202651295,
                                  'id_str': '2202651295',
                                  'indices': [59, 73],
                                  'name': 'XXXXX',
                                  'screen_name': 'YYYYY'}]},
 'favorite_count': 0,
 'favorited': False,
 'geo': None,
 'id': 416983627935670272,
 'id_str': '416983627935670272',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'lang': 'en',
 'metadata': {'iso_language_code': u'en', u'result_type': u'recent'},
 'place': None,
 'retweet_count': 0,
 'retweeted': False,
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'text': 'Going to see the Desolation of Smaug with @AllenWellingto1 @hannahkovacs3',
 'truncated': False,
 'user': { XXXXX }}

Some helper code.

What follows in the next cell are a few helper classes and functions to make future work easier and more fun.


In [15]:
class Tweet(object):
    def __init__(self, json):
        self.json = json
        
    @property
    def has_url(self):
        return self.json['entities']['urls']
    
    @property
    def url_title(self):
        """Return title of first URL page, if URL exists.
        """
        
        if self.has_url:
            # Grab the first URL.
            url =  self.json['entities']['urls'][0]['expanded_url']
            
            resp = requests.get(url)        
            soup = BeautifulSoup.BeautifulSoup(resp.content)
            results = soup.title.string
        else:
            results = None
            
        return results
        
    @property
    def is_retweet(self):
        """Indicate if this tweet is a retweet.
        https://dev.twitter.com/docs/platform-objects/tweets
        """
        return 'retweeted_status' in self.json
        
    @property
    def text(self):
        results = self.json['text']

        # Check to see if there any URLs embedded in text.
        if self.json['entities']['urls']:
            # Grab the first URL, crop ll URLs from text.
            ixs = self.json['entities']['urls'][0]['indices']            
            results = results[:ixs[0]]
            
        return results
    
    @property
    def id(self):
        """Twitter tweet ID.
        """
        return  int(self.json['id_str'])
    
    @property
    def timestamp(self):
        """Time when Tweet was created.

        # e.g. Sat Dec 28 16:56:41 +0000 2013'
        """

        format = 'ddd MMM DD HH:mm:ss Z YYYY'
        stamp = arrow.get(self.json['created_at'], format)
    
        return stamp


    def to_file(self, fname):
        """Serialize this Tweet to a JSON file.
        """
        b, e = os.path.splitext(fname)
        fname = b + '.json'
        
        json_io.write(fname, self.json)

    
    @staticmethod
    def from_file(fname):
        """Instanciate a Tweet object from previously-serialized Tweet.
        """
        b, e = os.path.splitext(fname)
        fname = b + '.json'

        json = json_io.read(fname)
        
        tw = Tweet(json)
        return tw
            
#######################################################################

            
def search_gen(query, since_id=None, since=None, until=None, lang='en', **kwargs):
    """Generator yielding individual tweets matching supplied query string.
    
    Parameters
    ----------
    query : str, Twitter search query, e.g. "python is nice".
    until : date string formatted as 'YYYY-MM-DD'.

    """
    gen = twitter.cursor(twitter.search, q=query, since_id=since_id, until=until, lang=lang, **kwargs)
    
    for json in gen:
        tw = Tweet(json)

        # Check each tweet for crap.
        is_crappy = tw.is_retweet or tw.has_url

        if not is_crappy:
            yield tw

Let's try running a quick query for recent tweets about the current Hobbit movie. Notice below that I am also searching for any URLs in the text. I use a combination of Requests and BeautifulSoup to fetch the title of whatever page is at the other end of that URL. In addition to the text of the actual tweet, I also want the time (UTC), date, and geographic location.

Use id_str instead of id. See this discussion for details https://groups.google.com/forum/#!topic/twitter-development-talk/ahbvo3VTIYI.


In [16]:
# Practice search on a topic and extracting information from returned tweets. 
q = 'Star Wars'

num_max = 15
gen = search_gen(q)

for k, tw in enumerate(gen):
    print('\ntweet: {:d}'.format(k))
    print('id: {:d}'.format(tw.id))
    print(tw.text)
    
    if k > num_max:
        break
        
    # Save tweet to JSON file.
#     name = 'tweet_{:s}_'.format(tw['id_str'])


tweet: 0
id: 417003839670484993
An hour ago i was watching Star Wars with no hope and now I am flying off the walls with a dangerous amount

tweet: 1
id: 417003834204893184
Where do prople buy Star Wars stuff? In a Star Wars store BY: Harry Styles

tweet: 2
id: 417003830514298880
This year's parody is 'Shakespeare does Star Wars' http://t.co/WO82I1TBxP

tweet: 3
id: 417003815435767808
'bout to watch Star Wars, @Lunsfuhd would be proud!

tweet: 4
id: 417003803473620992
Star Wars! ☺️👌✨

tweet: 5
id: 417003783793942528
@lheidensohn97 err they aren't cool enough to be star wars ://

tweet: 6
id: 417003753255235584
Watching Star Wars with my son for the 1st time...

#GLUED

tweet: 7
id: 417003839670484993
An hour ago i was watching Star Wars with no hope and now I am flying off the walls with a dangerous amount

tweet: 8
id: 417003834204893184
Where do prople buy Star Wars stuff? In a Star Wars store BY: Harry Styles

tweet: 9
id: 417003830514298880
This year's parody is 'Shakespeare does Star Wars' http://t.co/WO82I1TBxP

tweet: 10
id: 417003815435767808
'bout to watch Star Wars, @Lunsfuhd would be proud!

tweet: 11
id: 417003803473620992
Star Wars! ☺️👌✨

tweet: 12
id: 417003783793942528
@lheidensohn97 err they aren't cool enough to be star wars ://

tweet: 13
id: 417003753255235584
Watching Star Wars with my son for the 1st time...

#GLUED

tweet: 14
id: 417003839670484993
An hour ago i was watching Star Wars with no hope and now I am flying off the walls with a dangerous amount

tweet: 15
id: 417003834204893184
Where do prople buy Star Wars stuff? In a Star Wars store BY: Harry Styles

tweet: 16
id: 417003830514298880
This year's parody is 'Shakespeare does Star Wars' http://t.co/WO82I1TBxP

In [11]:
tw


Out[11]:
<__main__.Tweet at 0x5a36950>

In [12]:
t=Tweet(tw)

In [13]:
t.is_retweet


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-dad0e516f765> in <module>()
----> 1 t.is_retweet

<ipython-input-9-2e2f18896150> in is_retweet(self)
     29         https://dev.twitter.com/docs/platform-objects/tweets
     30         """
---> 31         return 'retweeted_status' in self.json
     32 
     33     @property

TypeError: argument of type 'Tweet' is not iterable

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

What about TextBlob?


In [ ]:


In [ ]: